SentRNA: Improving computational RNA design by incorporating a prior of human design strategies
نویسندگان
چکیده
Designing RNA sequences that fold into specific structures and perform desired biological functions is an emerging field in bioengineering with broad applications from intracellular chemical catalysis to cancer therapy via selective gene silencing. Effective RNA design requires first solving the inverse folding problem: given a target structure, propose a sequence that folds into that structure. Although significant progress has been made in developing computational algorithms for this purpose, current approaches are ineffective at designing sequences for complex targets, limiting their utility in real-world applications. However, an alternative that has shown significantly higher performance are human players of the online RNA design game EteRNA. Through many rounds of gameplay, these players have developed a collective library of "human" rules and strategies for RNA design that have proven to be more effective than current computational approaches, especially for complex targets. Here, we present an RNA design agent, SentRNA, which consists of a fully-connected neural network trained using the eternasolves dataset, a set of 1.8 x 104 player-submitted sequences across 724 unique targets. The agent first predicts an initial sequence for a target using the trained network, and then refines that solution if necessary using a short adaptive walk utilizing a canon of standard design moves. Through this approach, we observe SentRNA can learn and apply human-like design strategies to solve several complex targets previously unsolvable by any computational approach. We thus demonstrate that incorporating a prior of human design strategies into a computational agent can significantly boost its performance, and suggests a new paradigm for machine-based RNA design. Introduction: Solving the inverse folding problem for RNA is a critical prerequisite to effective RNA design, an emerging field of modern bioengineering research.1,2,3,4,5 A RNA molecule's function is highly dependent on the structure into which it folds, which in turn is determined by the sequence of nucleotides that comprise it. Therefore, designing RNA molecules to perform specific functions requires designing sequences that fold into specific structures. As such, significant efforts have been made over the past several decades in developing computational algorithms to reliably predict RNA sequences that fold into a given target.6,7,8,9,10,11 Existing computational methods for inverse RNA folding can be roughly separated into two types. The first type generates an initial guess of a sequence and then refines the sequence using some form of stochastic search. Published algorithms that fall under this category include RNAInverse,6 RNA-SSD,7 INFO-RNA,8 NUPACK,10 and MODENA.11 RNAInverse, one of the first inverse folding algorithms, initializes the sequence randomly and then uses a simple adaptive walk in which random single or pair mutations are successively performed, and a mutation is accepted if it improves the structural similarity between the current and the target structure. RNA-SSD first performs hierarchical decomposition of the target and then performs adaptive walk separately on each substructure to reduce the size of the search space. INFO-RNA first generates an initial guess of the sequence using dynamic programming to estimate the minimum energy sequence for a target structure, and then performs simulated annealing on the sequence. NUPACK performs hierarchical decomposition of the target and assigns an initial sequence to each substructure. For each sequence, it then generates a thermodynamic ensemble of possible structures and stochastically perturbs the sequence to optimize the "ensemble defect" term, which represents the average number of improperly paired bases relative to the target over the entire ensemble. Finally, one of the most recent algorithms, MODENA, generates an ensemble of initial sequences using a genetic algorithm, and then performs stochastic search using crossovers and single-point mutations. The second type of design algorithm, exemplified by programs such as DSS-Opt, foregoes stochastic search and instead attempts to generate a valid sequence directly from gradient-based optimization. Given a target, DSS-Opt generates an initial sequence and then performs a gradient-based optimization of an objective function that includes the predicted free energy of the target and a "negative design" term that punishes improperly paired bases. Both types of algorithms have proven effective given simple to moderately complex structures. However, there is still much room for improvement. A recent benchmark of these algorithms showed that they consistently fail given large or structurally complex targets, 12 limiting their applicability to designing RNA molecules for real-world biological applications. A promising alternative approach to RNA design that has consistently outperformed current computational methods is EteRNA, a web-based graphical interface in which the RNA design problem is presented to humans as a game. 13 Players of the game are shown 2D representations of target RNA structures ("puzzles") and asked to propose sequences that fold into them. These sequences are first judged using the ViennaRNA 1.8.5 software package6 and then validated experimentally. Through this cycle of design and evaluation, players build a collective library of design strategies that can then be applied to new, more complex puzzles. These strategies are distinct from those employed by design algorithms such as DSS-Opt and NUPACK in that they are honed through visual pattern recognition and experience. Remarkably, these human-developed strategies have proven more effective for RNA design than current computational methods. For example, EteRNA players significantly outperform even the best computational algorithms on the Eterna100, a set of 100 challenging puzzles designed by EteRNA players to showcase a variety of RNA structural elements that make design difficult. While top-ranking human players can solve all 100 puzzles, even the best-scoring computational algorithm, MODENA, could only solve 54 / 100 puzzles.12 Given the success of these strategies, we decided to investigate whether incorporating these strategies into a computational agent can increase its performance beyond that of current state-of-the-art methods. In this study, we present SentRNA, a computational agent for RNA design that significantly outperforms existing computational algorithms by learning human-like design strategies in a data driven manner. The agent consists of a fully-connected neural network that takes as input a featurized representation of the local environment around a given position in a puzzle. The output is length-4, corresponding to the four RNA nucleotides (bases): A, U, C, or G. The model is trained using the eternasolves dataset, a custom-compiled collection of 1.8 x 104 playersubmitted solutions across 724 unique puzzles. These puzzles comprise both the “Progression” puzzles, designed for beginning EteRNA players, as well as several “Lab” puzzles for which solutions were experimentally synthesized and tested. During validation and testing the agent takes an initially blank puzzle and assigns bases to every position greedily based on the output values. If this initial prediction is not valid, as judged by ViennaRNA 1.8.5, it is further refined via an adaptive walk using a canon of standard design moves compiled by players and taught to new players through the game's puzzle progression. Overall, we trained and tested an ensemble of 165 models, each using a distinct training set and model input (see Methods). Collectively, the ensemble of models can solve 42 / 100 puzzles from the Eterna100 by neural network prediction alone, and 80 / 100 puzzles using neural network prediction + refinement. Among these 80 puzzles are all 15 puzzles highlighted during a previous benchmark by Anderson Lee et al.12 Notably, among these 15 puzzles are 7 puzzles yet unsolvable by any computational algorithm. This study demonstrates that teaching human design strategies to a computational RNA design agent in a data-driven manner can lead to significant increases in performance over previous methods, and represents a new paradigm in machine-based RNA design in which both human and computational design strategies are united into a single agent. Methods: Code availability: The source code for SentRNA, all our trained models, and the full eternasolves dataset can be found on GitHub: https://github.com/jadeshi/SentRNA. Hardware: We performed all computations (training, validation, testing, and refinement) using a desktop computer with an Intel Core i7-6700K @ 4.00 GHz CPU and 16 GB of RAM. Creating 2D structural representations of puzzles: During training and testing of almost all models, we used the RNAplot function from ViennaRNA 1.8.5 to render puzzles as 2D structures given their dot-bracket representations. However, when training and testing two specific models M6 and M8 on two highly symmetric puzzles, “Mat Lot 2-2 B" and “Mutated chicken feet” (see Results and Discussion), we decided to use an in-house rendering algorithm (hereafter called EteRNA rendering) in place of RNAplot, as we found the RNAplot was unable to properly render the symmetric structure of the puzzles. Neural network architecture: Our goal is to create an RNA design agent that can propose a sequence of RNA bases that folds into a given target structure, i.e. fill in an initially blank EteRNA puzzle. To do this, we employ a fully connected neural network that assigns an identity of A, U, C, or G to each position in the puzzle given a featurized representation of its local environment. During test time, we expose the agent to every position in the puzzle sequentially and have it predict its identity. The neural network was implemented using TensorFlow14 and contains three hidden layers of 100 nodes with ReLU nonlinearities. The output is length-4, corresponding to the four RNA bases: A, U, C, and G. During validation and test time, base identities are assigned greedily to the puzzle based on these output values. Given a position x in the puzzle, the input for this position to the agent is a combination of information about its bonding partner, nearest neighbors, and long-range features, which can include, for example, next nearest neighbors or adjacent closing pairs in a multiloop. While the bonding partner and nearest neighbor information is provided to the agent by default, long-range features are learned through the training data. The information about the bonding partner is encoded as a length-5 vector, with each position in the vector representing either A, U, C, G, or "none" (i.e. a blank position that does not have a base assigned to it yet). A value of 1 is assigned to the position corresponding to the identity of the bonding partner, while all other values are set to 0. If there is no bonding partner, all values are set to 0. The nearest neighbor information is encoded as a length-11 vector, a combination of two length-5 onehots corresponding to the identities of the bases directly before and after it in the sequence, and a single value that corresponds to the angle in radians formed by the base and its nearest neighbors. This angle serves to distinguish bases belonging to different substructures in the puzzle. For example, a base situated in the middle of a large internal loop will have a larger angle than a base positioned in a 4-loop. As a design choice, any position in the middle of a stack of bonded bases was assigned an angle of 0. Also, if the model is looking at either the first or last position in the puzzle, the “before” and “after” nearest neighbor portions of the input respectively are set to 0. Long-range features refer to important positions y in the puzzle relative to x that the agent should also have knowledge of when deciding what base to assign to x. These are each defined by a set of two values: 1) the Cartesian distance, L, between x and y in the puzzle given the 2D rendering of the puzzle, and 2) the angle in radians, Φ, formed by positions x – 1, x, and y. These two values are stored in a list, [L, Φ], and serve as a label for the feature. For example, when using RNAplot, a label of [15.0, 1.6] corresponds to a base's bonding partner in the middle of a stem. The bonding distance is equal to 15.0 RNAplot distance units, and the angle between the previous base in the stem, the current base, and the bonding partner is 1.6 radians, or 90 degrees. A length-5 vector of zeros is then appended to the input vector to serve as a placeholder for the feature. During training, validation, or testing, when the agent is looking at a given position x, it computes L and Φ between x and other positions yi in the puzzle, and if both L and Φ match that of a long-range feature used in the model within some threshold, a 1 is assigned to the corresponding placeholder depending on the identity of yi (A, U, C, G, or “none”). We set the threshold for both L and Φ to 0.1 (distance units and radians respectively) when using RNAplot, and a stricter value of 10-5 for both L and Φ when using EteRNA rendering. We determine what long-range features to use (i.e. which features should be considered "important") using a mutual information metric over player solutions. First, we perform a pairwise mutual information calculation using all the player solutions for a given puzzle to form a l x l mutual information matrix, where l is the length of the puzzle. We then select the top M (user-defined) positions in the matrix with highest mutual information, and for each of these positions (x, y) compute L and Φ to give a list of long-range features for the puzzle (Figure 1). This process is repeated for each puzzle, and the unique long-range features across all the puzzles are then combined into an aggregate list of long-range features. A random subset of N (user-defined) features is then selected from this list and used to define a model to be used for training, validation, and testing. By defining long-range features using a mutual information metric, our goal is to impose a prior of human knowledge onto the agent. High mutual information between positions x and y indicates that the identity of position x is strongly correlated to that of position y. In other words, when EteRNA players are choosing in-game what to assign for x, they are first typically looking at y, or vice versa. Therefore, by only including positions with high mutual information to the agent’s field of vision, the agent prioritizes what humans have deemed to be important. As a result, we provide the model with enough information to prevent underfitting and enable it to apply its learned strategies to more difficult puzzles. On the other hand, we also limit the model complexity such that we can train the model using a relatively small number of training examples without overfitting. For example, all models in this study were trained using a maximum of 50 player solutions. Training algorithm: We used subsets of the first 721 / 724 puzzles from eternasolves to train the model, puzzles 722 and 724 for initial validation and testing respectively, and the Eterna100 for more extensive testing. Puzzle 723 was skipped due to being completely unstructured and not useful for validation. We confirmed that there was no contamination between the training, validation, and test sets. Because there is no straightforward way to determine a priori what long-range features and training examples will result in the best-performing model, we decided to train and test an extensive ensemble of models. To do this, we first computed an aggregate list of 42 long-range features using all puzzles from eternasolves with at least 50 submitted solutions, allowing each puzzle to contribute only one long-range feature (M = 1). We set this threshold of 50 player solutions since puzzles with a small number of solutions can introduce noise into the mutual information calculation. We then randomly selected a subset of long-range features from the aggregate list and built a model using these features. We built one model each using between 0 to 42 randomly chosen features, and repeated this process 20 times to build a total of 43 x 20 = 860 models. To form the training sets for these models, we first randomly chose 50 puzzles from eternasolves to serve as training puzzles. Each model then randomly sampled one solution from each of these puzzles to give a training set of 50 player solutions. In addition to the above 860 models, we built another 256 models that scanned more widely over hyperparameters. For these models, we computed the aggregate list of long-range features using all puzzles, not just those with 50 player solutions or more, and loosened the restriction on the number of long-range features that could be contributed by each puzzle (M = 1 to 4). This expanded the number of features included in the models to a maximum of 312. For each model, we picked a random set of 50 puzzles from eternasolves to serve as the training puzzles, and also allowed the model to select multiple player solutions from each puzzle to form the training set (n = 1 to 8). To train each model, we use the following procedure. For each player solution in our training set, we first visually render the corresponding puzzle using RNAplot. We then set the identity of every position in the puzzle to the corresponding base in the player solution, and featurize each position into bonding pair, nearest neighbors, and long-range features to form the input vector. The output label is set to the identity of the corresponding position in the player solution. Then, we decompose the player solution into a "solution trajectory" to teach the agent how to solve an initially blank puzzle with no bases assigned (i.e. during test time). This is done by first removing all base assignments from the puzzle. A position in the puzzle is then selected and featurized. All input features (bonding partner, nearest neighbors, long-range features) are at this point set to “none”, and the output label is set to the identity of the corresponding position in the player solution. This position is then filled in with the player solution base, and the next position is picked and featurized (Figure 1). This process continues until all positions in the puzzle have been featurized. The order in which puzzle positions are filled in during this process can be either sequential or stochastic (user-defined). By doing this, we are mimicking the process of a human player filling in the puzzle base by base and training the model to reproduce these steps. During validation and testing, the agent proceeds through each position in the (initially blank) puzzle sequentially and assigns bases greedily based on the model outputs. For our first batch of 860 models, we only used sequential fill-in. However, for the second batch of 256 models, we allowed both sequential and stochastic fill-in. We initialize each model using Gaussian weights (μ=0, σ=0.02), unit biases, and a learning rate of 0.001. We train each model using the Adam optimizer15 for a total of 1000 epochs, performing a validation on puzzle 722 every 100 epochs, to give a total of 10 models. The model with the highest validation accuracy is then used for testing on puzzle 724. During validation and testing, we allowed the model two attempts at predicting a sequence, once using a blank sequence as input, and again using the initial model-predicted sequence as input. The second attempt is intended as an opportunity for the model to refine its first prediction. If the model proposed valid solutions for both validation and test puzzles, it was then subjected to more extensive testing on the Eterna100. Finally, we built and trained two models using hand-picked training puzzles from eternasolves to allow the model to learn rarely seen, advanced strategies necessary to solve two puzzles from the Eterna100, Mat – Lot 2-2 B and Mutated chicken feet. These models were not tested on the full Eterna100, but only on their respective puzzles. In total, we trained a total of 860 + 256 + 2 = 1118 models, tested 163 of these models on the full Eterna100, and tested 2 models specifically on Mat – Lot 2-2 B and Mutated chicken feet, for a total of 163 + 2 = 165 models tested. Figure 1: The training and validation procedure for SentRNA consists of first selecting a puzzle, and then performing a pairwise mutual information calculation using all player solutions for that puzzle. Positions in the resulting mutual information matrix with high values are then used to define new long-range features that will be included in the model’s field of vision (steps 1-3). These features are appended to the base input vector that by default has information about the bonding partner and nearest neighbors. SentRNA is then trained to reproduce a player solution at each position in the puzzle. To train the model, we use a two-part training set consisting of both the full player solution as well as a synthetic “solution trajectory” to simulate the process of solving a puzzle base-by-base starting from a blank puzzle (step 4). During validation and testing, the model is exposed to each position in a new (initially blank) puzzle sequentially and greedily fills in bases one-by-one based on the model outputs (step 5). Refinement algorithm: During testing on the Eterna100, if the initial predicted solution does not fold into the target structure, as judged by Vienna 1.8.5, we further refine this solution using an adaptive walk. We use the following refinement moves: 1) pairing two unpaired bases that should be paired in the target structure, 2) re-pairing two paired bases that should be paired, 3) unpairing two paired bases that should not be paired, and 4) G or U-U boosting,16 two common stabilization strategies taught to beginning EteRNA players. During refinement, random trajectories of these moves are generated and applied to the initial sequence until one that folds into the target structure is reached. At any point, if an intermediate sequence is reached that folds into a structure more structurally similar to the target, which we define as the fraction of matching characters in the dot-bracket notations, the refinement trajectory ends and all subsequent trajectories begin from that point. Unless otherwise noted (see Results and Discussion), we limited the refinement to 300 trajectories of length 30, which takes at most 90 seconds for most puzzles in the Eterna100 (~100 bases or fewer in length). By comparison, all algorithms tested in the previous benchmark by Anderson-Lee et al. were given a much longer time limit of 24 hours.12 We also investigated the ability of the refinement algorithm to solve puzzles on its own without the benefit of the initial neural network prediction by repeating the refinement process for all the Eterna100 puzzles while starting from initial sequences generated by two common initialization methods used in previous computational algorithms (see Results and Discussion, Table 2). Testing statistical robustness of results: Over the course of testing models on the Eterna100, we noticed specific models that seemed particularly capable at solving difficult puzzles, solving them quickly after only a few minutes of refinement (see Results and Discussion). This leads to the apparent conclusion that these models are well-trained and provide useful prior information toward solving these puzzles. However, an alternative explanation is that by cherry-picking and looking at only a successful model, we are ignoring the possibility that the "success" of that one model may simply be by chance, the result of subjecting poorly trained models to many independent rounds of refinement. Therefore, to establish a quantitative metric of performance, for each model we will discuss, we repeated the refinement from the same initial model prediction n times and defined the refinement efficacy Eref as the fraction of times the puzzle is successfully solved. We also estimated the expected refinement time Tref by dividing the cumulative time tall taken for all trials by the number of successes, nsuccess. This represents the average amount of unproductive time elapsed before a valid solution is sampled. Error bars for Tref can be computed from the standard error for Eref, which is:
منابع مشابه
Design, simplified cloning, and in-silico analysis of multisite small interfering RNA-targeting cassettes
Multiple gene silencing is being required to target and tangle metabolic pathways in eukaryotes and researchers have to develop a subtle method for construction of RNA interference (RNAi) cassettes. Although, several vectors have been developed due to different screening and cloning strategies but still some potential limitations remain to be dissolved. Here, we worked out a simple cloning stra...
متن کاملIMPROVING COMPUTATIONAL EFFICIENCY OF PARTICLE SWARM OPTIMIZATION FOR OPTIMAL STRUCTURAL DESIGN
This paper attempts to improve the computational efficiency of the well known particle swarm optimization (PSO) algorithm for tackling discrete sizing optimization problems of steel frame structures. It is generally known that, in structural design optimization applications, PSO entails enormously time-consuming structural analyses to locate an optimum solution. Hence, in the present study it i...
متن کاملImproving results of urban design research by enhancing advanced semiexperiments in virtual environments
There is abundant literature regarding virtual reality as a technology of interest in the present age. However, there are few comprehensive studies on strategies that can improve the level of urban design research using this technique. To investigate the issue, this paper first reviews the concept of virtual reality. Next, the opinions of experts in the field of virtual reality technology are s...
متن کاملPERFORMANCE BASED DESIGN OPTIMIZATION OF STEEL MOMENT RESISTING FRAMES INCORPORATING SEISMIC DEMAND AND CONNECTION PARAMETERS UNCERTAINTIES
One of the most important problems discussed recently in structural engineering is the structural reliability analysis considering uncertainties. To have an efficient optimization process for designing a safe structure, firstly it is required to study the effects of uncertainties on the seismic performance of structure and then incorporate these effects on the optimization process. In this stud...
متن کاملDeveloping a Sense of Place by Humanizing Public Pedestrian Precincts
This research paper lays stress on philosophies of human desire. All the human being has some psychical need in addition to physical needs. And to satisfy the psychical needs, there is a need for stimulants, more in case of exclusive pedestrian precincts. Having exclusively Pedestrian precincts/Cores in city design draws its inspiration from the fact that public zones, accessible to majority of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018